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CN ables and defining a mean pattern between a sample of random events. Using 
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p-H an estimation of the mean distribution. Moreover, when the distributions are a com- 

mon measure warped by a centered random operator, then the barycenter enables 
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Keywords: Wasserstein Distance; Template estimation; Classification, 
e-mail: boissard, le-gouic, loubes@math.univ-toulouse.fr 

> 

£1 1 Introduction 

On 
in 

Giving a sense to the notion of mean behaviour may be counted among the very early 
i-H activities of statisticians. When confronted to large sample of high dimensional data, the 

usual notion of Euclidean mean is not usually enough since the information conveyed by 
the data possesses an inner geometry far from the Euclidean one. Indeed, deformations on 
the data such as translations, scale location models for instance or more general warping 
procedures prevent the use of the usual methods in data analysis. The mere issue of 
defining the mean of the data becomes a difficult task. This problem arises naturally for 
a wide range of statistical research fields such as functional data analysis for instance in 
[12], [21] and references therein, image analysis in [25] [24] or jl], shape analysis in [T7] 
or |14j with many applications ranging from biology in [8] to pattern recognition [22] just 
to name a few. 

Without any additional knowledge, this problem is too difficult to solve. Hence to 
tackle this issue, two main directions have been investigated. On the one hand, some 
assumptions are made on the deformations. Models governed by parameters have been 
proposed, involving for instances scale location parameters, rotations, actions of parame- 
ters of Lie groups as in [7] or in a more general way deformations parametrized by their 
coefficients on a given basis or in an RKHS set [2J. Adding structure on the deforma- 
tions enables to define the mean behaviour as the data warped by the mean deformation, 
i.e the deformation parametrized by the mean of the parameters. Bayesian statistics or 
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semi-parametric enable to provide sharp estimation of these parameters. However, the 
consistency of the estimator remains a theoretical issue for many cases. 

On the other hand, another direction consists in finding an adequate distance between 
the data which reveals the information which is conveyed. Actually, the chosen distance 
depends on the nature of the set where the observations belong, whose estimation is a 
hard task. We refer for instance to [5J or [20J for some examples. Once an appropriate 
distance has been chosen, difficulties arise when trying to define the mean as the minimum 
of the square distance since both existence and uniqueness rely on assumptions on the 
geometry of the data sets as pointed out in [6]. This will be the framework of our work. 

Assume that we observe j = 1, . . . , J samples of % = 1, . . . , n independent random 
variables Xij G lR d with distribution fij. We aim at defining the mean behaviour of these 
observations, i.e their mean distribution. For this we will extend the notion of barycenter 
of the distributions with respect to the Wasserstein distance defined in [lj to the empirical 
measures and prove the consistency of its estimate. Moreover, we will tackle the case where 
the distributions are the images of an unknown original distribution by random operators 
under some suitable assumptions. In this case, we prove that an iterative version of the 
barycenter of the empirical distributions provides an estimate which enables to recover 
the original template distribution when the number of replications J is large enough. 

The paper falls into the following parts. Section [2] is devoted to the extension of the 
notion of Barycenter in the Wasserstein space for empirical measures. In Section 3.2, we 
consider a modification of the notion of barycenter by considering iterative barycenters, 
which have the advantage to enable to recover the distribution pattern as proved in 
Section |4j Finally some applications for real data case are pointed out in Section [5j 



2 Barycenters in the Wasserstein space: Notations and 
general results 

Let (E, d, Q) denotes a metric measurable space. The set of probability measures over E 
is denoted by V(E). Given a collection of probability measures /ii, . . . , fij over E, and 
weights Ai, . . . , A j G R, Xj > 0, 1 < j '< J, J2j=i — 1> there are several natural ways 
to define a weighted average of these measures. Perhaps the most straightforward is to 
take the convex combination of these measures 

j 

— / ] Xjfij, 

i=i 

using the fact that probability measures form a convex subset of the linear space of 
finite measures. However, if we provide V(E) with some metric structure, the definition 
above is not really appropriate. 

We denote by V<i{E>) the set of all probability measures over E with a finite second- 
order moment. Given two measures fi, v in V(E), we denote by V(fi,v) the set of all 
probability measures 7r over the product set E x E with first, resp. second, marginal fi, 
resp. v. 



2 



The transportation cost with quadratic cost function, or quadratic transportation cost, 
between two measures /i, v in V^E), is defined as 

72(/U, v) = inf / d(x,y) 2 dTT. 

7r e r(n,v) J 

The quadratic transportation cost allows to endow the set of probability measures 
(with finite second-order moment) with a metric by setting 

w 2 ( f i,u) = r 2 (fi,u) 1 / 2 . 

This metric is known under the name of 2-Wasserstein distance. 

In Euclidean space, the barycenter of the points x±, . . . , Xj with weights Ai, . . . , Xj, 
Xj > 0, J2j=i = 1, is defined as 

J 
i=i 

It is also the unique minimizer of the functional 

J 

V ^ E(y) = ^Xjlxj -y\ 2 . 

3=1 

By analogy with the Euclidean case, we give the following definition for Wasserstein 
barycenter, introduced by M. Agueh and G. Carlier in pQ. 

Definition 2.1. We say that the measure [i G Vi{E~) is a Wasserstein barycenter for the 
measures fj,i, . . . , \ij G Vi[E) endowed with weights Ai, . . . , Xj, where Xj > 0, < j < J, 
and ^2j = i Xj — 1, if fJ, minimizes 

J 

3=1 

We will write 

Hb{X) = Bar((/i i ,A^)i^ 3 -<j). 

In other words, the barycenter is the weighted Frechet mean in the Wasserstein space. 
In [1], the authors prove that when E — R d and the measures fij, 1 < j < J satisfy suitable 
assumptions, the barycenter exists and is unique. For example, a sufficient condition is 
that one of the measures fij admits a density with respect to the Lebesgue measure. They 
also provide a problem that is the dual of the minimization of the functional E defined 
above, as well as characterizations of the barycenter. 

Next, we recall a version of Brenier's theorem on the characterization of quadratic 
optimal transport in M. d . Throughout all the paper we will use the following notation. 

Definition 2.2. Let E, F be measurable spaces and fi G V(E). Let T : E — > F be a 
measurable map. The push-forward of /i by T is the probability measure T#fi G V(F) 
defined by the relations 
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T#fj,(A) = /i(T _1 (yl)), A C F measurable. 

Hence Brenier's theorem can be stated as follows. 

Theorem 2.1 (Brenier's theorem, see [9]). Let fi,u G T^O^) be compactly supported 
measures, with \x absolutely continuous w.r.t. Lebesgue measure. Then there exists a 
/x-a.e. unique map T : Mr — > M. d such that 

• T # fi = v, 

• WZ(^v) = f Rd \T(x)-x\ 2 ^dx). 

Moreover, there exists a lower semi- continuous convex function if : IR d — > M such that 
T = V<f fi-a.e., and T is the only map of this type pushing forward fi to v, up to a 
^-negligible modification. The map T is called the Brenier map from ft to v. 

As observed in [I], the barycenter of two measures is the interpolant of these two 
measures in the sense of McCann. 

Proposition 2.2 (See [T], Section 6.2). Let e P20R d ) be absolutely continuous w.r.t. 
Lebesgue measure. Let T : lR d ->■ R d denote the Brenier map from \x to v. The barycenter 
of (/i, A) and (u, 1 — A) is 

ft X = (\Id+(l-\)T) #f i. 
This provides a natural expression for the barycenter of measures. 

3 Estimation of Barycenters of empirical measures 

Assume we do not observe the distributions /ij's but approximations of these distributions. 
Let /i™ G VzfJRL ) for 1 < j < J be these approximations in the sense that they converge 
with respect to Wasserstein distance, i.e W^ft™ , ftj) — > when n — > +oo. Our aim is to 
study the asymptotic behaviour of the barycenter of the /x"'s when n goes to infinity. 

3.1 Consistency of the approximated barycenter 

We are interested here in statistical properties of the barycenter of the . . . , /j/j. We 
begin by establishing a consistency result in Wasserstein topology. 

Theorem 3.1. Let J > 1, and for every n > 0, let fi™ G T^QE^)? 1 < j < J, be measures 
absolutely continuous with respect to the Lebesgue measure. Let Ai,...,Aj be positive 
weights. Let 

jl n (\) = Bar((ii],\ j ) 1 < j <j). 
Let fix, . . . G ) be absolutely continuous w.r.t. Lebesgue measure, and let 

yU B (A) = Bar[{iJ lj ,\j)x<j<j)- 

then when n — > +oo 
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3.2 An Iterative version of barycenters of measures 

Barycenters in Euclidean spaces enjoy the associativity property : the barycenter of 
X\,X2,X3 with weights Ai,A2,A3 coincides with the barycenter of £12, £3 with weights 
Ai + A2,A3 when X12 is the barycenter of Xi,x% with weights Ai,A2- This property, as 
we will see, no longer holds when considering barycenters in Wasserstein spaces over 
Euclidean spaces, with the notable exception of dimension 1. 

Therefore we introduce a notion of iterated barycenter as the point obtained by suc- 
cessively taking two-measures barycenters with appropriate weights. This does not in 
general coincide with the ordinary barycenter. However, we will identify cases where the 
two notions match. 

Definition 3.1. Let /i; E T > 2(E), 1 < i < n, and Aj > 0, 1 < i < n with Y17=i ^» = 1- 
The iterated barycenter of the measures /ii, . . . , /i n with weights Ai, . . . , A n is denoted by 
IB((/Xj, Aj)i<j< n ) and is defined as follows : 

• IB((/i x , A x )) = //!, 

• IB((^,A i ) \<i<n) 

= Bar [(IB((/i<, A*) l<i<n-l)i Al + • • • + A n _i), A n )] 

Remark. Iterated barycenters are well-suited to computations, since there exist efficient 
numerical methods to compute McCann's interpolant, see e.g. [15], |23j . Moreover, as 
we will see later, in some cases of interest the iterated barycenter does not depend on 
the order in which two-measures barycenters are taken, allowing for parallel computation 
schemes. 

The next proposition establishes consistency of iterated barycenters of approximated 
measures /i™, for j = 1, . . . , J. 

Theorem 3.2. The iterated barycenter is consistent : if /1™ — > fij in W2 distance for 
j = 1, . . . , J, then 

\j)i<j<j) IB((fMj, \j)i<j<j) 

in W2 distance. 

4 Deformations of a template measure 

We now would like to use Wasserstein barycenters or iterated barycenters in the following 
framework : assume that we observe probability measures . . . that are deformed 
versions, in some sense, of an original measure /i. We would like to recover /1 from the 
observations. Here, we propose to study the relevance of the barycenter as an estimator 
of the template measure, when the deformed measures are of the type fij = Tjj.fi for 
suitable push-forward maps Tj. 

Our aim here is to extend the results of J.F. Dupuy, J.M. Loubes and E. Maza in |1U| . 
They study the problem of curve registration, that we can describe as follows : given an 
unknown increasing function F : [a, b] 1— > [0, 1], and a random variable H with values in 
the set of continuous increasing functions h : [a, b] 1— > [a, b], we observe Foh^ 1 , . . . , Foh~ l 
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where hi are i.i.d. versions of H (randomly warped versions of F). Let /i G V(R) denote 
the probability measure that admits F as its c.d.f. : then the above amounts to saying that 
we observe hi#n, 1 < % < n. The authors build an estimator by using quantile functions 
that turns out to be the Wasserstein barycenter of the observed measures. They show 
that the estimator converges to (EH) 

Hereafter, we first define a class of deformations for distributions, which are modeled 
by a push forward action by a family of measurable maps Tj, j — 1, . . . , J undergoing the 
following restrictions. Such deformations will be called admissible. 

4.1 Admissible deformations 

Definition 4.1. The set GCF(Q) is the set of all gradients of convex functions, that is to 
say the set of all maps T : — > R n such that there exists a proper convex l.s.c. function 
: -> R with T = V0. 

Definition 4.2. We say that the family (Tj)j g / of maps on Q is an admissible family of 
deformations if the following requirements are satisfied : 

1. there exists i G I with T io = Id, 

2. the maps Tj : Q — > Q are one-to-one and onto, 

3. for i,j El we have T, L o Tf 1 e GCF(Q). 

The following Proposition provides examples are of such deformations. 

Proposition 4.1. The following are admissible families of deformations on domains of 
R n . 

• The set of all product continuous increasing maps on W l , i.e. the set of all maps 

T :x^{F 1 {x 1 ) ) ...,F n {x n )) 

where the functions Fi : R — > R are continuous increasing functions with Fi — >_ 00 
-oo, Fi -> +00 +oo. 

In particular, this includes the family of scale-location transformations, i.e. maps 
of the type x i->- ax + b, a > 0, b G R n . 

• The set of radial distorsion transformations, i.e. the set of maps 

T:R n ^R n , x^F(\x\)^ 

FI 

where F : R + h-> R + is a continuous increasing function such that F(0) = 0. 

• The maps t G o Tj o G where (Tj)j g / is an admissible family of deformations on Q 
and G G O n is a fixed orthogonal matrix. This family has l G(Q) as its domain. 
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Proof of Proposition 4.1 



Proof. Let us consider the first family. Checking the two first requirements is straight- 
forward and we only take care of the last one. Let S : x h- > (Fi(x\), . . . , F n (x n )) and 
T : x (Gi(xi), . . . , G n (x n )). The map S o T _1 is given by 

S o T-\x) = (F x o G^ 1 (x 1 ), ...,F n o G-\x n )) , 
and this is the gradient of the function 

x^ F 1 oGi 1 (z)dz + ...+ / F n oG- 1 (z)dz. 
Jo Jo 

The functions Fi o G^ 1 are increasing, so that their primitives are convex functions, 
which makes the function above convex. 

Second point : observe that radial distortion transformations form a group, so that we 
only need show that each such transformation is the gradient of a convex function. And 
indeed, T : x i— > F(\x\) ^ is the gradient of the function 

>\x\ 



rm 

x i — y I F(r)dr 
Jo 



and this is a convex function because F is increasing. 

The final item is a simple consequence of the observation that if G G GL n and / : 
E n ->■ R is differentiable, then V(/ oG) = f GoV/oG. □ 

4.2 Barycenter of measures warped using admissible deforma- 
tions 

We are interested in recovering a template measure from deformed observations. The un- 
known template is a probability measure fi on the domain Q C M d , absolutely continuous 
w.r.t. the Lebesgue measure A. We represent the deformed observations as push-forwards 
of fi by maps T : Q — > Q, i.e. we observe (Tj)#fi, j = 1, . . . , J. 



Theorem 4.2 states that when T~ belongs to an admissible family of deformations, tak- 



ing the iterated barycenter of the observations corresponds to averaging the deformations. 

Theorem 4.2. Assume that {Ti) ie j is an admissible family of deformations on a domain 
Q C M n , and let fi G V^Sl), fi << A. Let fij = (Tj)#fi. The following holds : 

J 

I B i(Vj,^)i<j<j) = (%2XjTj)#iM. 

3=1 

With this explicit expression at hand, we can check that in the case described above, 
the iterated barycenter coincides with the usual notion of barycenter. 

Proposition 4.3. Let fij = (Tj)#fi, where (Tj)j g / is an admissible family of deformations 
and fi << A. Then 



IB ((^Aj)i<j<j) = Bar((fij,\ 



■j)i<j<J) 
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Remark. 



1. The special case of the dimension 1 

In dimension 1, the set of all continuous increasing maps is an admissible family of 
deformations. The previous theorem applies for this very large class of deformations. 
Results in this case are known from [10] or [11] : the only new part here is that the 
estimator can be computed iteratively. 

2. Barycenters and iterated barycenters do not match in general. 

The fact that the two notions of barycenter introduced above coincide no longer 
holds as soon as the dimension is larger than 2. For a counterexample, consider 
the case of non-degenerate centered Gaussian measures 71, . . . , jj on W 1 , defined by 
their covariances matrices Si , . . . Sj G ■ 

According e.g. to [19J, Example 1.7, the optimal transport map from J\f(0,S) to 
jV(0, T) is given by 

x T l/2 (T l/2 5T l/2 ) -l/ 2T l/2 x _ 

From this result, it is possible to give an explicit expression of the iterated barycen- 
ter. 

On the other hand, according to Theorem 6.1 in [T], the barycenter of the fij with 
weights 1/J is the Gaussian measure with covariance matrix the unique positive 
definite solution of the fixed point equation 

i=^(mV /2 ) 1/2 ' 

J 3=1 

One may check that these two covariance matrices do not match in general. 

4.3 Template Estimation from admissible deformations 

Thanks to Theorem |4.2[ we can study the asymptotic behaviour of the barycenter when 
the number of replications of the warped distributions J increases. Actually, we prove 
that the barycenter is an estimator of the template distribution. 

Let T be a process with values in some admissible family of deformations acting on a 
subset 1 cR d . 

T: Q T(J) 
w (-)■ T(w,-), 

where (fl, A, P) is an unknown probability space, Assume that T is bounded and has a 
finite moment <£>(.) = E(T(.)). Let Tj for j = 1, . . . , J be a random sample of realizations 
of the process T. Then, we observe measures fij which are warped by Tj in the sense that 
for all , jij = Tjnfi. 
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Theorem 4.4. Assume that fi is compactly supported. As soon as tp = id, the barycenter 
ji B is a consistent estimate of ji when J tends to infinity in the sense that a.s 



Moreover, assuming that \\T — id\\i,2 < M a.s., we get the following error bounds : 
F(W 2 (n B ,pi) >e)< 2exp-J- 



e 2 



M 2 (l + ce/M)' 

Note that when the warping process is not centered, the problem of estimating the 
original measure fi is not identifiable and we can only estimate by the barycenter \xb the 
original measure transported by the mean of the deformation process, namely y?#/i. 

The proof of this theorem relies on the following proposition. 

Proposition 4.5. Let (Ti) ie j be an admissible family of deformations on a domain Q C 
W 1 , and let /i G V2^), fi « \. Let fij = (Tj)#[j,. Denote by hb the barycenter with 
equal weights 1/J. For every v in T^C^), we have 



J'=l 

where T u is the Brenier map from fi to v. 

Proof. With the explicit expression of the barycenter, we know that the Brenier map from 
fi to hb is 1/ J^2j =1 Tj, which implies that 

1 3 

is a coupling of \1b and v. Consequently, 

J 



W 2 2 ^b^)< I \- T Y^T {x)-T v {x)\ 2 ^d 

J J .7=1 



x). 

□ 



5 Statistical Applications 

5.1 Distribution Template estimation from empirical observations 

In many situations, the issue of estimating the mean behaviour of random observations 
play a crucial to analyze the data, in image analysis, kinetics in biology for instance. 
For this, we propose to use the iterative barycenter of a smooth approximation of the 
empirical distribution as a good estimate of the mean information conveyed by the data. 
Moreover, this estimate has the advantage that if the different 
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Assume we observe j = 1, . . . , J samples of i = 1, . . . , n points X^j G M, d which are 
i.i.d realizations of measures fij. Hence we observe cloud points or in an equivalent way 
V- ] = n 5^r=i ^Xij empirical versions of the measures fij. It is well-known that considering 
the mean with respect to the number of samples J of all observation points does not 
provide a good model of the mean behaviour. Instead we here consider the iterative 



barycenter [lb = IB(fij,j) defined in Definition 3.1 The following proposition shows 
that the regularization of the empirical distributions provides a consistent estimator of 
the true barycenter of the corresponding distributions. 

Proposition 5.1. Let 7 £ denote a A/"(0, el^) measure. Set 

fi] = fi]* 1\ in- 



Set fi B ,J = Barlfi™, j). As n — >■ +oo, we have 



n,J 

fi B — > fi B - 



Moreover, if 'the observations X_j ~ fij are warped from an unknown template distribution 
fi by a centered admissible deformation process, hence fi n B is a consistent estimate of fiB, 
in the sense that when n — > +oo and J — > +oo, we get 

\hi — ^ V-b m W2 distance. 

We point out that we have used here a Gaussian kernel regularization of the empirical 
measures fij. Actually, this regularization is needed in order to obtain the existence of the 
barycenter of the data. Note that any other regularization scheme may be used as soon as 
the corresponding measures converge to the true measures in Wasserstein distance when 
n goes to infinity. 

In particular, kernel estimates can be used in this framework. For instance, if there 
exist for all j = 1, . . . , J, a density with respect to A, the Lebesgue measure on IR d , fj 
such that, -^x — fji on e m ay use a kernel estimation of the density of the data. Let K 
be a multidimensional kernel in M. d . Let fj n = -j^Y17=i^-h(-i^i,j) be an estimator of 
the density fj. In this case, set fij^ n the distribution such that = fj >n . Let h = h n 
goes to zero. In this case, we clearly have ^(/^j.n; AO when n goes to infinity. Hence 
previous Proposition entails the consistency of the iterate barycenter of the /ij in 's. 

An important application is given by the issue of ensuring equality between the can- 
didates in an exam with several different referees. This constitutes a natural extension of 
the work in [10] to higher dimensions. 

Consider an examination with a large number of candidates, such that it is impossible 
to evaluate the candidates one after another. The students are divided into J groups, and 
J boards of examiners are charged to grade these groups: each board grades one group 
of candidates. The evaluation is performed by assigning p scores. The m different boards 
of examiners are supposed to behave the same way, so as to respect the equality among 
the candidates. Moreover it is assumed that the sampling of the candidates is perfect in 
the sense that it is done in such a way that each board of examiners evaluates candidates 
with the same global level. Hence, if all the examiners had the same requirement levels, 
the distribution of the ranks would be the same for all the boards of examiners. Here, 
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we aim at balancing the effects of the differences between the examiners, gaining equity 
for the candidates. The situation can be modeled as follows. For each group j among J 
groups of candidates, let X J = {Xf G M. p , i = 1, . . . , n} denote the scores of the students 
within this group. Let fij and [i^ n be respectively the measure and the empirical measure 
of the scores in the j-th group. 

We aim at finding the average way of ranking, with respect to the ranks that were 
given within the p bunches of candidates. For this, assume that there is such an average 
measure, and that each group-specific measure is warped from this reference measure by 
a random process. A good choice is given by the barycenter measure In order to obtain 
a global common ranking for the N candidates, one can now replace the p group- specific 
rankings by the sole ranking based on barycenter measure. Indeed each measure can be 
pushed towards the barycenter. As a result, we obtain a new set of scores for the N 
candidates, which can be interpreted as the scores that would have been obtained, had 
the candidates been judged by an average board of examiners. 

5.2 Discriminant analysis with Wasserstein distance 

Once we have succeeded in defining a mean of a collection of distributions, then the 
second step consists in trying to differentiate the different experiments with respect to 
this average distribution. For this consider Sj, j = 1, . . . , J the transport plan between 
the /ij's and \xb and write fij = Sj^^b- Clustering the experiments in order to build 
coherent groups is usually achieved by comparing a distance between these distributions. 
Here by choosing the Wasserstein distance we get that 

Wi(nB,Hj) = J \Sj(x) -x\ 2 dfx B = \\Sj - id|||2 ( ^ s) . 

Hence statistical analysis of the distributions /ij's amounts to clustering their Wasserstein 
square distance \\Sj — id||^ 2( .^ G K + , which can be easily achieved by any clustering 
methodology. 

Moreover PCA analysis can be conducted by generalizing the ideas of the usual PCA 
on Euclidean space. To extend the framework, one replaces the principal component 
directions with principal component curves from a suitable family of curves, e.g. geodesies. 
Following this idea, classical principal component analysis has been extended to situations 
such as manifolds, Kendall's shape spaces, and functional settings, see |21j . 

It is known that the Wasserstein metric endows the space of probability measures 
with a formal Riemannian structure, in which it is possible to define geodesies, tangents 
spaces, etc., see [16j, |3j. We propose here a method of principal component analysis using 
Wasserstein distance based on geodesies of the intrinsic metric for the one dimensional 
case, which follows the ideas developed in |16| . For this, consider a geodesic segment 7 at 
point [L with direction T, which can be written as 

Vt G [0, 1], 7 (t) = ((1 - t)Id + tT)#fi. 

We extend the definition of 7 to every t el, with the important provision that 7 is in 
general not a geodesic curve for the whole range of t G R. We perform PCA with respect 



11 



to this family of curves which we somewhat abusively refer to as "geodesic curves" on their 
extended range. We will come back to this discussion at the end of our analysis. 
For every fi, the natural distance to the geodesic curve 7 is given by 

d 2 ^, 1 ) = miW 2 (^ 1 (t)). 

Hence for any //j, j = 1, . . . , J let Fb and respectively Fj be the distribution functions of 
the barycenter measure //g and respectively the measures /ij, then define 

d 2 ^, 7) = mf J [((1 - t)Id + tT) o F B l - Fr 1 ] 2 dt. 

A geodesic 71 is called a first component geodesic to the fi/s if it minimizes the following 
quantity 

1 J 

J 3=1 

Then we call a geodesic 72 minimizes ([I]) over all geodesies that have at least one point in 
common with 71 and that are orthogonal to 71 at all points in common. Every point /j,* 
that minimizes \i 1— > j Y2j=i W2 (/^j) /-0 over a ^ common points of 71 and 72 will be called a 
principal component geodesic mean. Given the first and the second principal component 
geodesies 71 and 72 with principal component geodesic mean /i* we say that a geodesic 
73 is a third principal component geodesic if it minimizes ([!]) over all geodesies that meet 
previous principal components orthogonally at //*. Analogously, principal component 
geodesies of higher order are defined. 

Here we will focus on the computation of the first geodesic component 71. We first define 
a geodesic starting at a measure [i with distribution function F and directed by their 
increasing function T as 7(A) = ((1 — t)Id + tT)#fi. Hence, in that case, the distance of 
any measure fij with respect to such a geodesic can be written as 

d 2 ( H , 7) = mf J [((1 - t)Id + tT) o F- 1 - S j o F B '] 2 dx 

n qoF -x F -i|,2 <S J oF B '-F-\(S J -Id)oF^ > 2 
-\\SjoF B -F || ||(T-/rf)oF-T 

where |.| denotes the usual L 2 norm with corresponding scalar product < ., . >. Finally 
PCA with respect to Wasserstein distance amounts to minimizing with respect to T and 
F the following quantity 



T^]TdV„7) 



?1 (T-W)oF- 1 



■J^WSjoF^-F-r-^KSjoF^ 
i=i i=i 



|| (T- Id) oF- 1 ] 



> 



This optimization program can not be solved easily. Note that it requires to maximize in 
T the quantity 

J- ^ / I < O _b R — t , 777— — : ., > 

J B ' \\(T-Id) oF- 1 ! 



i=i 
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If we set v = (T — Id) o F 1 , this maximization can be written as finding the solution to 

J 

arg max N | < Sj o F^ 1 — > | 2 , 

v, IMI=i " ' 



which corresponds to the functional principal component analysis of the maps SjoFg , j = 
1, . . . , J where F plays the role of the mean of the data. Hence, in this framework, choosing 
F equal to Fb is a natural approximation of the solution of the initial optimization 
program. Then the final PCA is conducted by considering geodesies starting at \ib with 
direction found by maximizing 



T^y<(S j -I£ S oF B , w - J — [] 

J 

|2 



> 



|2 



I < Sj - Id, T - Id > 

3=1 



which corresponds to a functional PCA in y This analysis can be achieved using 

tools defined for instance in [21] . Finally, if we get the map corresponding to the 
first functional principal component, the corresponding principal geodesic if obtained by 
setting 7W = ((1 — t)Id + tT^)^HB- The other principal components can be computed 
using the same procedure. 

Let us come back to the caveat that the curves chosen are not Wasserstein geodesies 
on the entire parameter range. It is easy to check (see [3]) that a curve 7(t) = ((1 — 
t)Id + tT)#/j, is a geodesic curve for all t £ R such that (1 — £)Id + tT is an increasing 
function. Assuming T" takes values in the interval [a, b], < a < 1 < b, this means that 
7 is a geodesic curve for all t 6 [l/(a — 1), 1/(6 — 1)]. Once the analysis above yields the 
expression of TW and the t* minimizing d 2 (fij, r y), it is possible to check whether they fall 
in this range. Actually, 

<F^-F B \{T^-Id)oF B 1 > 
3 \\(T (1) - Id) o F^W 2 

Hence, when the measures fij are not too far from their barycenter (i.e. when the \\Sj — 
Id I loo are small) these conditions are met. 



6 Appendix 

Proof of Theorem 13.11 
Proof. Let 

J 

T : (xi, ■ ■ ■ , Xj) i-)- } u XjXj. 

3=1 
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Following [lj, we call 7 G V(M> dx J ) a solution of the multi-marginal problem associated 
with fix, . . . , fij if it is a minimizer for the functional 



)| )*({dxx,...,dxj) 

among all measures 7 G "P(M dx ) with marginals fix, ... , fij. Theorem 4.1 in [Ij, quot- 
ing from W.Gangbo and A.Swiech [T3], shows that when the fij are absolutely continuous 
w.r.t. Lebesgue measure the multi-marginal problem has a unique solution 7 (actually 
absolute continuity of the measures is more than is required in the theorem). Moreover 
(Proposition 4.2 in pQ) the barycenter of the fij is obtained as fiB = T#j. 

For every n > 1, we associate to fij, 1 < j < J, the solution 7" of the multi- 
marginal problem. We also denote by 7* the solution of the multi-marginal problem 
w.r.t. fi x ,...,fij. 

We show that the sequence 7™ is weakly tight. Let B\, . . . , Bj be large balls in IR d , we 
have 



7 n ((5i x . . . B,j) c ) = 7 n (u/ =1 £ x...xExBjXE...xE) 

j 

< V 7 n (E x...xExBjxE...xE) 
3=1 

3=1 

Tightness of the sequences /i™ guarantees tightness of 7™. If convergence of n > 1, 
holds in Waserstein distance, we also recover tightness of 7™ in Wasserstein topology. 
Indeed, the second moments are bounded as they form a convergent sequence : 



j \ x \ 2 d~i n = [ \xj\ 2 dfi] 
J j=i J 



]\Xj\*dflj. 

The above implies tightness of the sequence of barycenters : indeed, it is the push- 
forward of the tight sequence (7™) n >i by the application T : IR dxJ — > IR d , which is Lipschitz 
continuous (with Lipschitz constant bounded by 1). It is readily checked that this opera- 
tion preserves tightness, as it preserves convergence (in weak and Wasserstein topologies). 

We conclude by showing that any limiting point fi^ is a minimizer for the barycenter 
problem associated with fix, . . . ,fij, and by invoking the uniqueness of the barycenter. 
Since fi n is the barycenter for /i™, . . . ,fi n j, we have 
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Assume now that up to a subsequence, fi n — > fi^ in Wasserstein distance, and let 
n — > +00. Since W 2 is weakly lower semi-continuous, we get 

j j j 

^Aj-WfOUMi) < liminf ^XjWKjjT,^) < liminf ^ A^V, 
j=i j=i j=i 

The right-hand side converges to the value X//=i AjWf (/**) Mi)? which is minimal by 
definition. This shows that the inequalities are equalities and it concludes the proof. 

□ 



Proof of Theorem 13.21 

Proof. One sees from the definition that it is sufficient to prove the result for two measures, 
because then the result may be obtained by recurrence. Consider then [i n — > // and 
Vn — > v (convergence is understood in Wasserstein topology), and fix t G (0,1). The 
Brenier transport map between fi and v (resp. fi n and u n ) will be denoted by T (resp. 
T n ). Also denote by //, resp. //* the point in the Wasserstein geodesic between \x and v 
(resp. jj, n and v n ) at time t, that is to say 



/i* = ((l-t)Id + tT) # /i (2) 
f i t n = ((l-t)Id + tT n ) #f i n . (3) 

As noted earlier, /i* is the barycenter of \i and v with weights 1 — t and t, so that we 
need only prove weak continuity of (fi, v) (-> [/. We first take care of weak convergence, i.e. 
we assume that fi n — ^ fi, v n — ^ v and we show that fi f n — > /A The measure tt G P(M n x M n ) 
defined by tt = (IdxT)#/z is the unique optimal transport plan between fi and v. Likewise, 
iT n = (Id x T n )#fj, n is the optimal transport plan between \i n and v n . Now, Theorem 
5.20 in [26] (stability of transport plans) ensures that ir n weakly converges to tt. Let 
ft : W 1 x M. n — >■ IR n be defined by f t (x,y) = (1 — t)x + The map / t is continuous, so 
that pushing forward by f t is a weakly continuous mapping. Therefore, ft#x n ft# n - ^ 
only remains to check that in fact, ft#K n = and /t#7r — 

We now look at the convergence in W2 distance. Observe that the transport plans 
converge for the W2 metric over "P 2 (IR n x W 1 ) : to see this, we use the fact that convergence 
in W2 topology is equivalent to weak convergence plus convergence of second moments. 
And indeed, as we noted, TT n — ^ tt, and on the other hand, 

J \(x,y)\ 2 dTT n (x,y) = J (\x\ 2 + \y\ 2 ) dir n {x,y) 

\x\ 2 dfi n {x) + / \y\ 2 dv n (y) 



x\ 2 d[i(x) + / \y\ 2 dv(y) 
\(x,y)\ 2 dTT. 
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Now we observe that $ n = ft#x n and fi l = ft#K. But ft is Lipschitz, and it suffices to 
use the fact that W 2 (f t# ^n, ft#n) < \\ft\\upW 2 (7r n , vr). 

□ 

Proof of Theorem 14.21 

Proof. We use induction on J. For J — 1, the result is obvious. Suppose then that it 
is established for J > 1. Choose T 1; . . . ,Tj +1 from a family of admissible deformations, 
and fix Ai, . . . , Xj+i with X//=i A j = Using the definition of the iterated barycenter, 
we have 



IB((/j lj ,\ j ) l < :j <j +1 ) = Bar IB (l^j, Xj)i<j<j),^2\j , (//j+i,A 



J+i, 



Bar 



where we set Aj = X]/=i A r 

Set v = £^ =1 XjTj)#fi. As /i J+ i = T J+ i # /i, we have also = (T j+ i)^Vj+i ; and 



a ; N 

3 

1 J 



*A 



Now, observe that by assumption all the maps Tj o (Tj +1 ) _1 are gradients of convex 
functions, so that their convex combination also is. By Brenier's theorem, the map 

3 j=l 

is the Brenier map from to v. We deduce that the barycenter of v and /ij+i is 

(A J+ iId + AjT) # /i J+ i 
= (A,/ + iTj + i + AjT o T J+ i) # // 
J+i 

3=1 

This finishes the proof. 

□ 



Proof of Proposition 4.3 
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Proof. Set T(xi, . . . , xj) = Ylj=i ^jXj for x±, . . . ,Xj G M. d . Proposition 4.2 of [T] claims 
that the barycenter of (fij, Xj)i<j<j, denoted by /ie, satisfies /ig = T#7 where 7 G 
V((R d ) J ) is the unique solution of the optimization problem 



infjy ^A.lTOr)-^! 2 ^ 



xi,...,xj), 7 g n(/ii, . . . ,/xj) 

where II (/xi, . . . is the set of probability measures on IR dJ with j-th marginal fij, 
1 < j < J. This can be rewritten as 



inf 



/ ^2\^j\x i -x j \ 2 d'y(x 1 ,...,xj), 7 G n(/ii, . . . > . 



The integral is bounded below by Yldj=i K^jW^Lii, Lij) (because each term of the sum 
is bounded by W^Lii, fij)). On the other hand, choosing 

7 = (Ti, ■ ■ ■ ,Tj) # n, 
we see that 7 G II(/i 1 , . . . and that 



— £j| 2 G?7 = y |^(x) — Ti(x)\ 2 dfi(x) = j [Tj o Tj — x| 2 /ij(c?x) = W^i/J-i, fJ>j)- 
Thus 7 is optimal, and we have 

j 



□ 



Proof of Theorem 14.41 



Proof. Using the results of Corollary 4.5, we get that 



w 2 (fi B ,fi)< [ \-j2 T ^ x )- x \ 2 ^ da 
J J .7=1 



Almost sure convergence towards of j Y2j=i(Tj — id) is directly deduced from Corollary 
7.10 (p. 189) in [18J, which is an extension of the Strong Law of Large Numbers to Banach 
spaces. Then the result follows from dominated convergence. 

Likewise, obtaining error bounds is straightforward. Assuming that \\T — id\\L2 < M 
a.s., we can use Yurinskii's version of Bernstein's inequality in Hilbert spaces (|2Z], P- 
491) to get the result announced. □ 



Proof of Proposition 5.1 
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Proof. By the empirical law of large numbers, /i" 



fij weakly. Moreover W^jjij , /x") < -, 

we obtain 



3.2 
T, 



where the 



so that /i™ — > fij in Wi metric, see for instance in [26] . Hence using Theorem 
the consistency of the barycenter. Moreover, if fij = Tj„fi for all j — 1, . . . 
Tj's are an admissible family of bounded deformations, hence Theorem 4.4 enables to get 
that 

W 2 (fJ% ,Hb) — > 



when both n and J goes to infinity. 



□ 
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